Skip to content

[CONTP-1758] Improve DatadogGenericResource reconciliation at scale#3143

Open
tbavelier wants to merge 12 commits into
mainfrom
tbavelier/thousands-of-ddgrs
Open

[CONTP-1758] Improve DatadogGenericResource reconciliation at scale#3143
tbavelier wants to merge 12 commits into
mainfrom
tbavelier/thousands-of-ddgrs

Conversation

@tbavelier

@tbavelier tbavelier commented Jun 15, 2026

Copy link
Copy Markdown
Member

What does this PR do?

Improves DatadogGenericResource reconciliation behavior at high resource counts:

  • adds configurable DatadogGenericResource controller concurrency with --datadogGenericResourceMaxConcurrentReconciles
  • adds configurable DDGR status polling cadence with --datadogGenericResourceRequeuePeriod and DD_GENERIC_RESOURCE_REQUEUE_PERIOD
  • marks DDGR live-state polling requeues as low priority so create/update/delete work can run ahead of status refresh backlog

Motivation

At high CR counts, periodic status polling can dominate the controller queue and delay user-facing create/update/delete operations. This keeps DDGR status polling tunable and lower priority while leaving normal reconciliation and backend error retries at regular priority.

Should help in situations like #2816

Additional Notes

The low-priority path is limited to DDGR refreshState reconciles, which currently applies to resource types that fetch backend state such as monitors and SLOs. It intentionally does not lower the priority of create/update/delete retries

Validated manually that workqueue_depth{controller="DatadogGenericResource", priority="-100"} exposes low-priority queue depth for status polling requeues. Created 1k DDGR with 10 concurrent reconciles, then changed to 1 concurrent reconcile to accumulate backlog in low prio queue for status update. Created an additional DDGR and verified it was treated in priority over the backlog (synced with backend)

image shows the low priority queue backlog while the regular priority is staying healthy

Minimum Agent Versions

No minimum Agent or Cluster Agent version changes.

  • Agent: N/A
  • Cluster Agent: N/A

Describe your test plan

2 things to test "separately":

  • datadogGenericResourceMaxConcurrentReconciles and low priority
  • DD_GENERIC_RESOURCE_REQUEUE_PERIOD
  1. Deploy operator with DDGR enabled and --datadogGenericResourceMaxConcurrentReconciles=10
  2. With a script/something, create 1k DDGRs at "once": operator should be able to go through the queue in a few minutes (each DDGR should get an ID)
  3. Change datadogGenericResourceMaxConcurrentReconciles back to 1 (can remove it since 1 is the default) and wait for a minute or two while the new operator pod acquires lease (or delete lease directly + restart the pod after)
  4. Backlog should accumulate in low priority queue (status updates) as operator can't keep up with 1k resources synced around the same time
  5. Create a new DDGR, e.g. k apply -f examples/datadoggenericresource/dashboard-sample.yaml
  6. Ensure it is synced, it should get an ID, ahead of the status updates
  7. Delete it k delete ddgr ddgr-dashboard-sample: once again, ensure it is indeed deleted
  8. Change again to --datadogGenericResourceMaxConcurrentReconciles=10 and clean up all the resources (again with a script/something)

  1. Deploy operator with DDGR enabled and DD_GENERIC_RESOURCE_REQUEUE_PERIOD env var set to 5m
  2. Create a few DDGRs monitors and verify the state sync is after 5mn
    ╰─❯ k get ddgr -w
    NAME                        ID          SYNC STATUS   STATE   LAST STATE SYNC   AGE
    ddgr-load-test-monitor-1    296107688   OK                                      3s
    ddgr-load-test-monitor-10   296107700   OK                                      3s
    ddgr-load-test-monitor-2    296107690   OK                                      3s
    ddgr-load-test-monitor-3    296107693   OK                                      3s
    ddgr-load-test-monitor-4    296107696   OK                                      3s
    ddgr-load-test-monitor-5    296107694   OK                                      3s
    ddgr-load-test-monitor-6    296107695   OK                                      3s
    ddgr-load-test-monitor-7    296107697   OK                                      3s
    ddgr-load-test-monitor-8    296107701   OK                                      3s
    ddgr-load-test-monitor-9    296107699   OK                                      3s
    ddgr-load-test-monitor-1    296107688   OK            OK      2026-06-16T09:51:01Z   5m1s
    ddgr-load-test-monitor-2    296107690   OK            OK      2026-06-16T09:51:02Z   5m1s
    ddgr-load-test-monitor-3    296107693   OK            OK      2026-06-16T09:51:02Z   5m1s
    ddgr-load-test-monitor-7    296107697   OK            OK      2026-06-16T09:51:02Z   5m1s
    ddgr-load-test-monitor-4    296107696   OK            OK      2026-06-16T09:51:02Z   5m1s
    ddgr-load-test-monitor-6    296107695   OK            OK      2026-06-16T09:51:02Z   5m1s
    ddgr-load-test-monitor-9    296107699   OK            OK      2026-06-16T09:51:02Z   5m1s
    ddgr-load-test-monitor-10   296107700   OK            OK      2026-06-16T09:51:02Z   5m1s
    ddgr-load-test-monitor-8    296107701   OK            OK      2026-06-16T09:51:02Z   5m1s
    ddgr-load-test-monitor-5    296107694   OK            OK      2026-06-16T09:51:02Z   5m1s
  3. Change it to a non time value (e.g. foo)
  4. Verify the log is present and that resources are synced every 60s (default)
    {"level":"ERROR","ts":"2026-06-16T09:59:18.652Z","logger":"controllers.DatadogGenericResource","msg":"Invalid value for generic resource requeue period. Defaulting to 60 seconds.","error":"time: invalid duration \"foo\"","stacktrace":"github.com/DataDog/datadog-operator/internal/controller/datadoggenericresource.requeuePeriodFromEnv\n\t/workspace/internal/controller/datadoggenericresource/controller.go:91\ngithub.com/DataDog/datadog-operator/internal/controller/datadoggenericresource.requeuePeriod\n\t/workspace/internal/controller/datadoggenericresource/controller.go:82\ngithub.com/DataDog/datadog-operator/internal/controller/datadoggenericresource.NewReconciler\n\t/workspace/internal/controller/datadoggenericresource/controller.go:69\ngithub.com/DataDog/datadog-operator/internal/controller.(*DatadogGenericResourceReconciler).SetupWithManager\n\t/workspace/internal/controller/datadoggenericresource_controller.go:52\ngithub.com/DataDog/datadog-operator/internal/controller.startDatadogGenericResource\n\t/workspace/internal/controller/setup.go:234\ngithub.com/DataDog/datadog-operator/internal/controller.SetupControllers\n\t/workspace/internal/controller/setup.go:100\nmain.run\n\t/workspace/cmd/main.go:413\nmain.main\n\t/workspace/cmd/main.go:223\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:285"}
╰─❯ k get ddgr -w
NAME                        ID          SYNC STATUS   STATE   LAST STATE SYNC        AGE
ddgr-load-test-monitor-1    296107688   OK            OK      2026-06-16T09:59:18Z   14m
ddgr-load-test-monitor-10   296107700   OK            OK      2026-06-16T09:59:18Z   14m
ddgr-load-test-monitor-2    296107690   OK            OK      2026-06-16T09:59:18Z   14m
ddgr-load-test-monitor-3    296107693   OK            OK      2026-06-16T09:59:18Z   14m
ddgr-load-test-monitor-4    296107696   OK            OK      2026-06-16T09:59:18Z   14m
ddgr-load-test-monitor-5    296107694   OK            OK      2026-06-16T09:59:18Z   14m
ddgr-load-test-monitor-6    296107695   OK            OK      2026-06-16T09:59:18Z   14m
ddgr-load-test-monitor-7    296107697   OK            OK      2026-06-16T09:59:18Z   14m
ddgr-load-test-monitor-8    296107701   OK            OK      2026-06-16T09:59:18Z   14m
ddgr-load-test-monitor-9    296107699   OK            OK      2026-06-16T09:59:18Z   14m
ddgr-load-test-monitor-4    296107696   OK            OK      2026-06-16T10:00:19Z   14m
ddgr-load-test-monitor-8    296107701   OK            OK      2026-06-16T10:00:19Z   14m
ddgr-load-test-monitor-10   296107700   OK            OK      2026-06-16T10:00:19Z   14m
ddgr-load-test-monitor-2    296107690   OK            OK      2026-06-16T10:00:19Z   14m
ddgr-load-test-monitor-1    296107688   OK            OK      2026-06-16T10:00:19Z   14m
ddgr-load-test-monitor-7    296107697   OK            OK      2026-06-16T10:00:19Z   14m
ddgr-load-test-monitor-5    296107694   OK            OK      2026-06-16T10:00:19Z   14m
ddgr-load-test-monitor-3    296107693   OK            OK      2026-06-16T10:00:20Z   14m
ddgr-load-test-monitor-9    296107699   OK            OK      2026-06-16T10:00:20Z   14m
ddgr-load-test-monitor-6    296107695   OK            OK      2026-06-16T10:00:19Z   14m

Checklist

  • PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
  • PR has a milestone or the qa/skip-qa label
  • All commits are signed (see: signing commits)

@tbavelier tbavelier added the enhancement New feature or request label Jun 15, 2026
@tbavelier tbavelier added this to the v1.29.0 milestone Jun 15, 2026
@datadog-datadog-prod-us1-2

datadog-datadog-prod-us1-2 Bot commented Jun 15, 2026

Copy link
Copy Markdown

Code Coverage

Fix all issues with BitsAI

🛑 Gate Violations

🎯 1 Code Coverage issue detected

A Patch coverage percentage gate may be blocking this PR.

Patch coverage: 61.70% (threshold: 80.00%)

ℹ️ Info

🎯 Code Coverage (details)
Patch Coverage: 61.70%
Overall Coverage: 44.96% (+0.92%)

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: f2af21b | Docs | Datadog PR Page | Give us feedback!

@codecov-commenter

codecov-commenter commented Jun 15, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 64.70588% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 44.73%. Comparing base (72bc0a0) to head (f2af21b).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...al/controller/datadoggenericresource_controller.go 0.00% 6 Missing ⚠️
cmd/main.go 0.00% 4 Missing ⚠️
...al/controller/datadoggenericresource/controller.go 89.18% 3 Missing and 1 partial ⚠️
internal/controller/setup.go 0.00% 4 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3143      +/-   ##
==========================================
+ Coverage   43.79%   44.73%   +0.93%     
==========================================
  Files         375      377       +2     
  Lines       30575    31508     +933     
==========================================
+ Hits        13390    14094     +704     
- Misses      16276    16489     +213     
- Partials      909      925      +16     
Flag Coverage Δ
unittests 44.73% <64.70%> (+0.93%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
cmd/main.go 6.60% <0.00%> (-0.08%) ⬇️
...al/controller/datadoggenericresource/controller.go 86.18% <89.18%> (+7.37%) ⬆️
internal/controller/setup.go 71.71% <0.00%> (-1.94%) ⬇️
...al/controller/datadoggenericresource_controller.go 0.00% <0.00%> (ø)

... and 13 files with indirect coverage changes


Continue to review full report in Codecov by Harness.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 72bc0a0...f2af21b. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tbavelier tbavelier force-pushed the tbavelier/thousands-of-ddgrs branch from d08e2b9 to b6051a5 Compare June 15, 2026 15:10
@tbavelier tbavelier force-pushed the tbavelier/thousands-of-ddgrs branch from b6051a5 to 29c92c9 Compare June 16, 2026 09:05
@tbavelier tbavelier changed the title Improve DatadogGenericResource reconciliation at scale [CONTP-1758] Improve DatadogGenericResource reconciliation at scale Jun 16, 2026
@tbavelier tbavelier force-pushed the tbavelier/thousands-of-ddgrs branch from d4a9c5c to 136adf5 Compare June 16, 2026 09:58
@tbavelier tbavelier marked this pull request as ready for review June 16, 2026 10:03
@tbavelier tbavelier requested a review from a team June 16, 2026 10:03
@tbavelier tbavelier requested a review from a team as a code owner June 16, 2026 10:03

@drichards-87 drichards-87 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some suggestions from Docs and approved the PR.

Comment thread docs/datadog_generic_resource.md Outdated
Comment thread docs/datadog_generic_resource.md Outdated
Comment thread docs/datadog_generic_resource.md Outdated
Comment thread docs/datadog_generic_resource.md Outdated
Comment thread docs/datadog_generic_resource.md Outdated
Comment thread docs/datadog_generic_resource.md Outdated
tbavelier and others added 6 commits June 17, 2026 11:08
Co-authored-by: DeForest Richards <56796055+drichards-87@users.noreply.github.com>
Co-authored-by: DeForest Richards <56796055+drichards-87@users.noreply.github.com>
Co-authored-by: DeForest Richards <56796055+drichards-87@users.noreply.github.com>
Co-authored-by: DeForest Richards <56796055+drichards-87@users.noreply.github.com>
Co-authored-by: DeForest Richards <56796055+drichards-87@users.noreply.github.com>
Co-authored-by: DeForest Richards <56796055+drichards-87@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants